Beyond Markovian: Bayes-Adaptive RL for Reflective Exploration in LLMs

BARL: More Efficient and Adaptive Reasoning in Large Language Models

Published

May 26, 2025

Authors: S. Zhang et al.
Published on Arxiv: 2025-05-26
Link: http://arxiv.org/abs/2505.20561v1
Institutions: Northwestern University • Google DeepMind • Google
Keywords: large language models, reinforcement learning, Bayes-adaptive RL, reflective exploration, reasoning, Markov decision process, token efficiency, mathematical reasoning, Bayesian inference, self-reflection, generalization, GRPO, progress reward, Chain-of-Thought, Qwen2.5-Math, DeepSeek-R1-Distill-Llama, GSM8K, MATH, CollegeMath, OlympiadBench

Random Unsplash-style image

Large Language Models (LLMs) have demonstrated notable reasoning abilities and emergent reflective behaviors, especially when fine-tuned using reinforcement learning (RL). However, traditional Markovian RL frameworks limit exploration to the training phase and restrict policies to only current-state information, neglecting the potential benefits of leveraging history for reflective reasoning during inference. This gap underlines the necessity to optimize and better understand reflective exploration for improved generalization and reasoning efficiency in LLMs.

Building on this context, the authors propose a new Bayes-Adaptive RL (BARL) approach, introducing novel methodologies and contributions:

Moving from the methodology to results, the evaluation demonstrates significant improvements with BARL over traditional methods:

Taken together, the results motivate strong conclusions about the benefits of BARL, which are summarized as follows: